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On estimation states of hidden markov models 
in condition of unknown transition matrix 

Vasily Vasilyev, Alexander Dobrovidov 


Abstract —In this paper, we develop methods of nonlinear 
filtering and prediction of an unobservahle Markov chain with 
a finite set of states. This Markov chain controls coefficients of 
AR(p) model. Using observations generated by AR(p) model we 
have to estimate the state of Markov chain in the case of an 
unknown probability transition matrix. Comparison of proposed 
non-parametric algorithms with the optimal methods in the case 
of the known transition matrix is carried out by simulating. 


where Pr(5'„ = m \ Vf) is a posterior probability with 
respect to a u-algebra, generated by r.v. V". Its realization 
will be denoted by 

= m I Vi" = xl) = P{Sn = m I x^), (4) 

where we will write Xi instead of Vf = a;". 


Index Terms —hidden markov models, statistical signal pro¬ 
cessing, filtering and prediction, optimization problem, kernel 

density estimation. A. Basic equations 


I. Introduction 


H idden markov models are very popular for modeling 
and simulating processes, when you do not observe... 


II. System Model 

Let {SmXn) be a two-component process, where {Sn) is 
unobservable component and (V„) is observable one, n S 
{1,2,... ,N}, A S N; {Sn) “controls” equation coefficients 
of (V„). Let {Sn) be a stationary Markov chain with M 
discrete states and transition matrix ||pij ||, Pi,j = Pr(S'n = 
j I Sn-i = i)- The process (V„) is described by the 
autoregressive model of order p\ 


In this paper we consider methods of filtering and prediction 
in the case of unknown parametres (transition matrix) of 
process (S'„) and known parametres (equation coefficients 
in ([TJ) of process (V„). For comparison with some standard 
we also consider optimal filtering and prediction, where all 
parametres are known. 

Filtering is a problem to estimate Sn by using V". There¬ 
fore basic equations for filtering 


P{Sn =m\x^) 

_ f{Xn I Sn = 

fixu I a^r') 


P{Sn = m\x^-^), (5) 


p 

Xn = p{Sn) + X ai{Sn){Xn-i “ p{Sn)) + (L 

i=l 

where {^n} are i.i.d. random variables with the standard 
normal distribution, p,ai,b G R are coefficients controlled 
by the process {Sn)- 

As a quality measure for our methods we use mean risk 
E{L{Sn, Sn)) with a simple loss function L: 


L{Sn, Sn) = 


1, Sn ^ Sn 

0 , Sn = Sn 


( 2 ) 


where Sn = Sn{X[') is an estimator of and Vf = 
(Vi,V2,...,V„). 

As known, for this risk function with the loss function (|2]l 
the optimal estimator is 


Sn = argmax Pr(5'„ = m | V”), (3) 
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f(Xn I X^ ^) 

M 

= X I ^ rn,x'^~~^)P{Sn = m \ (6) 

m—1 

can be obtained from the total probability formula. Since 
coefficients in ([T]l are known and ^ V(0, 1) then 

f{Xn I Sn = m,Xi~^) 

= f{Xn\ Sn=m,x'!^Zl) = fm{Xn), ( 7 ) 

where 

P 

= 4>(^Xn-,p{m) + ^ ai{m){xn-i - p.{m)),h^{m)^ (8) 

i=l 

with normal probability density function 

where a;, ^ G R, .u G R+. 
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HI. Optimal Filtering 

In the optimal filtering all parametres are known. We use © 
knowing coefficients in ([T]) and calculate P{Sn = m \ 
in © knowing transition matrix: 

M 

P{Sr, = m I X^-^) = Y,P^,mP{Sn-l = i \ (10) 

Then the Q is transformed to the evaluation equation |[T1 


M 


fm{Xn) Y. Pi,mP{Sn-l = ^ | ) 


P{Sn = m\x^) = 


2=1 


M M 

E fj{Xu) E Pi,jP{Sn-l = i I X^~^) 

j=l i=l 


which will be considered as the optimal standard. 


is simplex. Let us rewrite estimator u„ with more detailes: 

u„ = argmin/i - 2 I 2 + I 3 , 
uGAm 


where 


h 

1 2 

1 3 


+ 00 

= J P{zn I 

— 00 

M 

— I ^ ^ f I ^n—r)f'fni^n^'^rndZri-, 

^ ^ — 1 


+“ M M 

EE fi{Zn)fj{Zn)UiUjdz„ 

-1 -1 


i = l 7 = 1 

— on 


IV. Non-parametric Filtering 

A. Reducing to optimization problem 

In this section, the transition matrix \\pij\\ is assumed 
unknown, therefore we can not use the equation (doll. To 
overcome this uncertainty we include formula © in equa¬ 
tions ©, © and obtain 

P{Sn = m\Xi)= •Un(TO), (II) 

f{Xn I X^ ) 

M 

f{Xn I X^~^) = ^ fm(Xn)Un(m), (12) 


Un{m) = P{Sn = m \ Xi ^), Vm = l,...,M 
are new variables, which do not depend on Xn and 


Since Ii does not depend on u, then reduce it, also transform 
I 2 and I 3 , so u„ has representation 

u„ = argmin I 3 - 2/2 
uCAm 

MM M 

= argmin ^ ^ CijUiUj - 2 ^ CmUm, (14) 
uGAm 

where 


+ 00 


Cij = J fi{Zn)fj{Zn)dZn, 

— 00 
+ 00 

Cm — / f I 


(15) 

(16) 


M 

Ui = 1 , Um > 0, Vm = 1,..., M. 

i=l 

To calculate (fTTT i and (fT^ it is neccessary to find all 
We need to make the assumption. We suppose that process 
{Sn,Xn) is a-mixing, then 

fixn I Xi~^) Ri fixn \ X^Z].), T £ {1, 2, . . . , n - 1}, 

and estimate density f{Xn \ x^z]-) using kernel density 
estimation and designate this estimator like f{xn \ x^z\)- 
Let us introduce vector u„ = (m„( 1), Un(2),..., u„(M)) 
with unknown elements Un{m),.m = Then for 

calculating u„ one proposes the following estimator 

+“ M 

= argmin / |/(z„ | x^z\) - ^ fm{zn)um?dz^, (13) 
UGAm 

where 

Am = |(fi, < 2 , ■ • ■, iu) S 

M 

iy^L = l,L>0,Vie{l,2,...,M}} 

i=l 


To solve optimization problem (fl4l i. primarily, it is necessary 
to calculate latter coefficients (flSl l and (fTST l. which we will 
obtain using kernel density estimators. Therefore we introduce 
following chapter. 

B. Kernel density estimators 

In the general case kernel density estimator of density / is 
1 ^ 

/(y;H) = -^i^H(y-Y0, (17) 

2=1 

where y = (j/i, ?/ 2 , ■ ■ ■, 2/d)^ is argument and = 

(lii, yi 2 , • ■ •, Yid)^ , i = 1, 2 ,..., are drawn from density 
/; iTH(y) = where K{y) is the multi¬ 

variate kernel, which is probability density function; H G "H 
is the bandwidth matrix and T-L is the set of d x d, symmetric 
and positive-definite matrixes. We propose to use unbiased 
cross-validation (UCV) to find H (univariate case proposed 
in ©, 0 and multivariate in ID, ©). This is a popular and 
relevant method is aimed to estimate 

ISE(H) = J (/(y;H)-/(y))Ey 
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and then minimize resulting function 
UCV(H) 

N N 

= • Ah - 2Ah)(Y. - Y,) 

^ i—l j—1. 

+ ^i?(K)|H|-i/2, (18) 


RiK) = J K{yfdy, 

R'i 

where * denotes a convolution. Then the estimator of H is 


Hucv = argminUCV(H). (19) 

Hew 

We suppose to generate components Yik of vector Yi from 
univariate sample xi, 312 , ■ ■ •, according to the rule 

Yfc — — /l = 1,2,..., d 

where I G N influences on stochastic dependence between 
vectors (for bigger I less dependence). Then we suggest 
to simplify obtaining of estimator (fTTl l and function (fTSl l. For 
this aim we: 

• use normal kernel, it means that we set equal H to d- 
variate normal density with zero mean vector and identity 
covariance matrix ifi; 

• use scalar multiple of identity d x d matrix (I^) for 
bandwidth matrix: 

H = h^U 

Then the estimator (fTTI) becomes 


fiy;h) 


I 

E ivi - 

E i=i 
exp 


N 


\ 


2/l2 


N{2Trf/^h‘^ ^ 
with = 1 + and the estimator of h is 


V 




h = argminUCV(/i), 
h>0 


, ( 20 ) 


( 21 ) 


UCV(h) 


N{N - l)(27r)‘'/2/i^ ^ ^ 2'^/2 

3¥=i 


^ - 2e ^ 


7V(47r)rf/2/jd ’ 


In ||6l it was shown that spurios local minima are more likely 
at too small values of h, so we propose to use golden section 
search between 0 and /i+, where 

u+ f 

d = TTTT-XT max tTfe, 

yA^(d + 2 ) J 


where dk is the sample standard deviation of fc-th elements of 
Yi. The parameter h+ is an oversmoothed bandwidth. If the 
matrix H was an unconstrained then 

1 

4 \ d+4 

N{d + 2)) 

where S is a sample covariance matrix of Y^. The matrix 
H+ is oversmoothed bandwidth in the most cases. The latter 
estimator is proposed in ||7]. To calculate Hycv with uncon¬ 
strained H you may use quasi-Newton minimization algorithm 
like in Q. 



C. Calculation of coefficients Cij and Cm 

For calculating unknown coefficients Cy and Cm in (O we 
use formulas (flST l and (fThl l. Observe that for normal probability 
density function @ following equation 


+ 00 

— 00 

= f{dCd2,crl -f cr|) = +<T2) 

is correct, therefore using it and ® we have 

Cij = / fjzn; fj-ji) + ak{i){xn-k - ji{i)),h^{i)] 

-00 

P 

■ (t>{zn-,d{j) + ^ak{j){Xn-k - ^J,ij)),b^{j)jdz„ 
k^l 

( ^ 

= fuiii) + ^ ak{i)(xn-k - 

fc=i 

fO') + X! afc0’)(x„_fc - + b^{j)j , (22) 

fe=i 

also Cij = Cjj > 0. For calculating Cm we estimate conditional 
density f{zn \ x'^z\) applying dJoli: 


d 

Yxij ^ ^ . 

fc=l 

Computing minima analytically is a challenge, so a numerical 
calculation is popular. The function UCV(h) often has multi¬ 
ple local minima, therefore more correct way is to use brute- 
force search to hnd h, however it is a very slow algorithm. 


fiZn 


X 


n — 1 
n—T 


) 


f{Zn,xl_\.) 

+ (X) 

/ fi.Zn,xlZ].)dZn 

— 00 

N 

~ ^(z—l)/+r+l 5 ^ ); 
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exp 


/3m (t) = 


N 

E exp 


-1 \ 

(^Ti+j—a:(i_i)z+j_|.T+i) 

3 = --r _ 

2/15 

^_L 

^ (^n+j—X(k — l)l+j + T + l)^ 


~2ji^ 


V 


/ 


• for convex optimization KKT conditions, which are pri¬ 
marily necessary, are also sufficient; 
else you may apply methods of quadratic programming. Also 
we want remark that £" does not depend on variables Ui 
and coefficients Cm, which means that previous kernel density 
estiamtors have no influence on kind of optimization. 

Let us consider KKT conditions, then Lagrangian is 


where iV = 1 + bandwidth h is estimated by dlTT l. 

Remark that fini{T) does not depend on z„. Then we substitute 
latter estimator in (fThl l and obtain 

+ 00 

Cm ~ J' f {^n I ^n—r'} 

— oo 

N 

~ I ^ ^ i^ni iy )0(-^n ; 1)^+T+11 ^ ) 

-oo 

p 

■ 4>{zn-, iJiijn) + ^ ak{m){xn-k - 

k=l 

N +“ 

~ 'y ('/") / ^ ) 

i=l -L 

P 

■ (j)(^Zn; nim) + '^akim){xn-k - 

k=l 

N 

— y Pni(T)cj)(x(i — i)i^T+l] 

i=l 

P , 

^(m) + ^ akim){xn-k - At(w)), h'^ + , (23) 

k=l 

also we remark that Cm > 0. 

D. Solution of optimization problem 

In the previous chapters we reduce main problem to opti¬ 
mization problem 

u„ = argminF„(u), 
uCAm 

MM M 

^ ^ ^ ^ CijUiUj 2 ^ ^ 
i=l j—1 m—1 

where coefficients Cy and Cm were calculated in (l22l i and (l23t . 
Let us consider kind of optimization. We have that Am is 
convex set and Hessian matrix of function Fn{s) is 

f Cii Ci2 ... CiM \ 

C21 C22 • • ■ C2M 

= : : .. : ■ 

\CM1 CM2 ■ • ■ CmM j 

If £" is positive defined, then Fn{s) is convex, thus we have 
convex optimization. In this case we propose to use Karush- 
Kuhn-Tucker (KKT) conditions JS], |0, because of; 

• our case is special because there is opportunity to solve 
KKT conditions analytically; 


M 


Z M 


£ — '^o£n(tt) F y ^ t/i) “t“ Xm+1 ^ ^ ^ 


i=l 


where A* = (Aq, A^,..., A^^^) G We need to find 

A* and u* such that stationary condition 


M 


£'. = 2a;; 


I -a: + ai,+i = o, 


vi=i 


V* = 


primal feasibility 


-M-<0, Vf=l,...,M 


M 




dual feasibility 

A*>0, Vi = l,...,M 
complementary slackness 

A*<=0, V^=1,...,M 

hold. Let Aq = 0 to check that the gradients of constraints are 
linearly independent at u*, so KKT conditions lead to system 

\* _ \* _ _ 

— ^2 — ... — 

A*u* = 0, A*>0, Vf = l,...,M 


I I 

M 

E u* = 1, u* > 0, 

, i=l 




which could be solved only with A* = 0, which means that 
gradients of constraints are linearly independent for any u*. 
The vector A* is defined with an accuracy of a > 0, so we 
define Aq = 1/2, then KKT conditions lead to a system 

C • p = c, 

where 


C = 


/ Cll 

Cl2 

CiM 

-1 

0 ••• 

0 

1\ 

C21 

C22 

C2M 

0 

_1 ... 

0 

1 

cmi 

CM2 

• • • cmm 

0 

0 ••• 

-1 

1 

V1 

1 

1 

0 

0 ••• 

0 

0/ 


/ \ 


P = 




C = 


/ci\ 

C2 


Cm 

Vi/ 


V'^M+l/ 

\*U* = 0, A* > 0, < > 0. V* = 1,..., M 
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TABLE I 

Sample Mean Errors 



Filtering error, % 

Prediction error, % 

Optimal 

16.4 

26.6 

Non-parametric 

22.7 

37.6 


To solve last system it is necessary to consider all combina¬ 
tions of pairs (u*,A*), Vi = where u* or A* is 

equal to 0 (not both). Total amount of combinations is equal 
to 2^. If u* = 0 then i-th column in the matrix C and i-th 
row in p are reduced, else A* = 0 and (M + 2 )-th column 
in the matrix C and (M + 2 )-th row in p are reduced. After 
choosing zero element in each pair {u*, A*), Vi = 1 ,..., M 
matrix C is reduced to an (M + 1) x [M + 1)-matrix and 
p to (M -b 1) X 1-matrix pr- Therefore for each combination 
it is necessary to calculate 

Pr = • C. 

If the first M elements in pr are non-negative then obtained 
u* is a solution (u„) of optimization problem and there is no 
reason to calculate pr for the next combination, because in 
convex optimization local minima is global minima. 

As a result, we substitute estimator u„ in (fTTTi and (fTSli and 
problem of non-parametric filtering is solved. 

V. One-step Ahead Prediction 

We will consider one-step ahead prediction. Like for filter¬ 
ing we minimize mean risk E{L{Sn, Sn)) with simple loss 
function (|2]i. Therefore optimal estimator of is 

Sn = argmax Pr(5'„ = m \ 

We remark that probabilty Pr(5'„ = m \ is already 

obtained in the considered approaches of filtering: for opti¬ 
mal prediction it is written in (doll and for non-parametric 
prediction accordingly in di. It means that we primarily 
solve problem of one-step ahead prediction and then filtering 
problem. 


VI. Example 


Let the Markov chain (S'„) has 3 states (M = 3) and 
transition matrix 


lb.,2II 


0.8 

O.I 

O.I A 


0.05 

0.9 

0.05 . 

(24) 

O.I 

0.05 

0.85/ 



Sample volume n is changed from 500 to 600. Observable 
process (Xn) is simulated like AR(2) model with coefficients 
p G {0,0.5,!}, ai G {0.3,0.2,0.11, 02 G {0.2,0.3, 0.4}, 
b G {O.1,0.2,0.I}. Also we take r = 2 and I = I. The 
results are presented in Fig. [T] and sample mean errors after 
50 repeated experiments in Table |T] 



500 510 520 530 540 550 560 570 580 590 600 



500 510 520 530 540 550 560 570 580 590 600 

3 ■ 

2 


1 

500 510 520 530 540 550 560 570 580 590 600 

3 
2 



u 

II II 

1 u 

bur 1 

1 


500 510 520 530 540 550 560 570 580 590 600 


Fig. 1. From top to bottom: 1 — unobservable Sn', 2 — observable Xn', 
3, 4 — optimal and non-parametric filtering; 5, 6 — optimal and non- 
parametric prediction. 
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